FT-NIRS Otolith Age Prediction

This RMD serves as a guide/lab journal to document my methods for processing and modelling FT-NIRS otolith spectra from juvenile Pacific cod (Gadus macrocephalus) in the Gulf of Alaska. These scans were acquired on a Bruker MPA II with an integrating sphere, 5mm. teflon disk taped over the scanning window and gold transflectance stamp placed over the top of samples. Spectra were collected between 11,500 and 4,000 cm-1 with a resolution of 16 cm-1 and 64 replicate scans.

LPW Scans refer to specimens collected near Little Port Walter (LPW), a NOAA research station located near the southern tip of Baranof Island in Southeast Alaska. Juvenile cod were collected using beach and purse seines from embayments near LPW on July 27-28, 2020. These fish were transferred into outdoor net pens at LPW to be reared in captivity. Specimens were collected approximately weekly (ADD MORE IN HERE LATER)

Load packages, import .0 files, create spectral dataframe

This segment is adapted from code provided by Dr. Esther Goldstein at NOAA’s AFSC in Seattle. (ADD INFO ON PACKAGES INCLUDED?)

Load necessary packages
# Install packages not yet installed
packages <- c("dplyr", "tidyr", "EMSC", "purrr", "pls", "devtools", "hyperSpec", "prospectr", "data.table", "mdatools", "opusreader2", "splitstackshape", "ggplot2", "viridis")
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  utils::install.packages(pkgs = packages[!installed_packages])
}
invisible(lapply(packages, library, character.only = TRUE)) # load all packages in list

# install packages not on CRAN
if (!require("remotes")) install.packages("remotes")
if (!require("opusreader2")) {
  remotes::install_github("spectral-cockpit/opusreader2")
  library("opusreader2")
}
if (!require("simplerspec")) {
  remotes::install_github("philipp-baumann/simplerspec")
  library("simplerspec")
}

rm(installed_packages) # remove objects from environment
rm(packages)

Importing FT-NIRS Spectral Files (Opus .0) & Building Dataframe

The Bruker MPA outputs FT-NIRS spectra in a .0 format. Prior to loading spectral data, a metadata .csv (LPW_metadata.csv) was generated to pair specimens biological data with their respective FT-NIRS scans.

LPW Metadata Variables:
  • Specimen # in dataset (specimen)
  • Fish fork length (to nearest whole mm) (length)
  • Fish whole body weight (to nearest whole gram), may have been taken when partially frozen (weight)
  • Weight of otolith (nearest 0.0001 gram) used for FT-NIRS scan (structure_weight)
  • otolith read age (read_age) in days representing an average of multiple reads (agreement <= 5%)
  • side of otolith scanned (side)
  • % of otolith with some sort of abnormality including crystalization or broken/chipped portions (percent_affected)
  • other problems include tissue stuck to otoliths (other_problem)
  • whether an otolith was broken or chipped at all (broken)
  • scan name/session or run number (scan_name, run_number) to indicate separate scans for the same specimen (LPW specimens scanned in triplicate)
  • Filename of .0 opus files for each specimen & scan session (file_name)

Grabbing filenames from .0 files for metadata .csv

To facilitate easier .0 file loading, all file names for .0 files should be added to the metadata file. All filenames in a folder can be generated with list.files(). For pulling file names to create a new .csv, you can use:

This output (without index numbers in square brackets) can be copied and pasted into Excel, then converted into columns using the “Data -> Text to Columns” function and selecting a “space” delimiter. These columns can then be transposed into a file_name column.

Reading in FT-NIRS spectra from a Bruker MPA II

First I import my metadata (LPW_metadata.csv) and convert the session_title (i.e. scan #) to a factor. LPW FT-NIRS scans were taken in triplicate (NIR_LPW202001202_otoliths, ...otoliths_rescan_1, ...otoliths_rescan_2) to see if spectra changed after storage. Metadata for LPW contained an additional entry (..._rescan_3) with some notes and comments, however I will only be importing the metadata for actual scans.

meta_LPW <- read.csv("LPW_metadata.csv") # import metadata
levels(as.factor(as.character(meta_LPW$session_title))) # set session_title to factor (three scans taken at separate times).
## [1] "NIR_LPW202001202_otoliths"          "NIR_LPW202001202_otoliths_rescan_1"
## [3] "NIR_LPW202001202_otoliths_rescan_2" "NIR_LPW202001202_otoliths_rescan_3"
meta_LPW <- meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths", "NIR_LPW202001202_otoliths_rescan_1", "NIR_LPW202001202_otoliths_rescan_2"), ] # only include scan sessions 1, 2 & 3
meta_LPW$specimen[meta_LPW$file_name == ""] # check for missing file_names in metadata
## [1] 66
# the first scan of specimen 66 is missing and will be excluded for analysis.

meta_LPW <- meta_LPW[meta_LPW$file_name != "", ] # remove any rows that are missing a file_name in the metadata

# check number of specimens for each scan session (122 specimens, 365 total scans after excluding specimen 66, scan 1)
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths"), ])
## [1] 121
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_1"), ])
## [1] 122
nrow(meta_LPW[meta_LPW$session_title %in% c("NIR_LPW202001202_otoliths_rescan_2"), ])
## [1] 122
# generate file paths to import with Opusreader
setwd("LPW_scans/") # setwd to folder containing FT-NIRS scans
meta_LPW$file_path <- paste0(getwd(), "/", meta_LPW$file_name) # generate filepaths for all .0 files
Opusfiles <- as.vector(meta_LPW$file_path) # store filepaths in object
exists <- as.vector(lapply(Opusfiles, file.exists)) # check that I have all my files or else I get an error when I read them in
meta_LPW$exists <- exists # add column to metadata dataframe to confirm file exists in directory 
meta_LPW1 <- meta_LPW[meta_LPW$exists == "TRUE", ] # filter the file list and data by otoliths with spectra files
Opusfiles <- as.vector(meta_LPW1$file_path) # from Esther, "I repeated this and wrote over it so I wouldn't have extra files to read in that don't exist and produce an error"
rm(exists)
rm(meta_LPW1)

Once all metadata has been imported into a dataframe with filepaths for .0 files, FT-NIRS scans can be imported using the opusreader2 package and read_opus(). First a single file is read-in to make sure the script works and everything looks good; then all .0 files are imported using the Opusfiles object. read_opus() creates a massive list of lists that includes information about MPA II settings, metadata generated during scans, specimen names, etc. See ?read_opus() from the opusreader2 package for more details on the list structure of .0 files.

# read a single file (one measurement) to check that script runs without failure
file <- Opusfiles[1]
data_list <- opusreader2::read_opus(dsn = file) # NA's seem to be introduced due to a timestamp metadata issue, currently using suppressWarnings to hide the annoying output, but this is normal.
rm(data_list)
rm(file)
SPCfiles_nooffset <- suppressWarnings(lapply(Opusfiles, opusreader2::read_opus)) # Read in all files in the Opusfiles list.  This gives an error if any file names or paths are wrong.
# uncomment to see additional output from .0 files

# str(SPCfiles_nooffset[[1]]) # check first element
# SPCfiles_nooffset[[1]][[1]]$ab$data # can see spc values this way I think
# SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters #this has info about what setting was used (here otolith), species, and file name
SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FC2$parameter_value # species
## [1] "PACIFIC_COD"
SPCfiles_nooffset[[1]][[1]]$lab_and_process_param_raw$parameters$FD1$parameter_value # unique ID, including project title (LPW2020), experimental designation (01 for experimental, mortalities are numbered 02), species code (202), specimen number (_1_) and scan number (OA1)
## [1] "LPW202001202_1_OA1"
# SPCfiles_nooffset[[1]][[1]]$ab$wavenumbers # wavenumbers scanned
SPCfiles_nooffset[[1]][[1]]$instrument_ref$parameters$INS$parameter_value # instrument name
## [1] "MPA II"

Once all .0 files with included metadata have been successfully read-in, FT-NIRS scan data can be extracted from the lists.

# extract absorbance measurements, wavenumbers scanned, unique specimen ID, species  & instrument name from .0 files list
spectra <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$data) # FT-NIRS scan data, absorbance measurements at all 949 wavenumbers
wavenumber <- lapply(SPCfiles_nooffset, function(x) x[[1]]$ab$wavenumbers) # list of all wavenumbers scanned
file_id <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FD1$parameter_value) # unique ID in Opus
species <- lapply(SPCfiles_nooffset, function(x) x$lab_and_process_param_raw$parameters$FC2$parameter_value) # species
instrument <- lapply(SPCfiles_nooffset, function(x) x[[1]]$instrument_ref$parameters$INS$parameter_value) # instrument name
 # these could differ if settings changed or in light sources change

# create dataframe out of list
spectra <- lapply(spectra, as.data.frame) 
# add metadata to dataframe
for (i in 1:length(spectra)) {
  colnames(spectra[[i]]) <- wavenumber[[i]] # assign wavenumbers to columns of absorbance measurements
}
for (i in 1:length(spectra)) {
  spectra[[i]]$species <- species[[i]] 
}
for (i in 1:length(spectra)) {
  spectra[[i]]$file_id <- file_id[[i]]
}
for (i in 1:length(spectra)) {
  spectra[[i]]$instrument <- instrument[[i]]
}
for (i in 1:length(spectra)) {
  spectra[[i]]$file_path <- Opusfiles[[i]]
}

# extract file_name from file_path: test code on one file to 
try <- spectra[[1]] # specimen 1, scan session 1 
splitstackshape::cSplit(as.data.frame(try$file_path), sep = "/", splitCols = "try$file_path", type.convert = FALSE) %>% select(tail(names(.), 1))
##                     try$file_path_9
##                              <char>
## 1: PACIFIC_COD_LPW202001202_1_OA1.0
rm(try)
# create file_name variable for all scans
file_name <- lapply(spectra, function(x) splitstackshape::cSplit(as.data.frame(x$file_path), sep = "/", splitCols = "x$file_path", type.convert = FALSE) %>% select(tail(names(.), 1)))
file_name[[1]][[1, 1]] # check that first element is correct
## [1] "PACIFIC_COD_LPW202001202_1_OA1.0"
# add file_name variable to spectra dataframe
for (i in 1:length(spectra)) {
  spectra[[i]]$file_name <- file_name[[i]][[1, 1]]
}
rm(file_id, file_name, instrument, species, wavenumber, Opusfiles, i)

Build final spectral dataframe for modelling and graphing

df <- as.data.frame(do.call(rbind, spectra)) # convert spectral list/dataframe to dataframe object
dfmeta_LPW <- dplyr::left_join(meta_LPW, df, by = c("file_name", "file_path")) # join metadata to spectral data
dfmeta_LPW <- dfmeta_LPW %>% select(-c("exists", "instrument")) # remove exists and instrument columns
rm(meta_LPW, df, spectra, SPCfiles_nooffset)
colnames(dfmeta_LPW) <- as.character(colnames(dfmeta_LPW)) # make sure column names are characters, not numeric
dfmeta_longLPW <- pivot_longer(dfmeta_LPW, cols = -c(1:20)) # pivot absorbance measurements to long format for easy plotting
dfmeta_longLPW <- dfmeta_longLPW %>% rename(., "wavenumber" = "name") # rename name column to wavenumber for clarification
dfmeta_longLPW$wavenumber <- as.numeric(as.character(dfmeta_longLPW$wavenumber)) # change class of wavenumber variable to a numeric

# Plot 7 specimens to see if data loaded properly.
ggplot() +
  geom_path(data = dfmeta_longLPW[dfmeta_longLPW$specimen %in% c(1, 10, 100, 101, 102, 103, 104), ], 
            aes(x = wavenumber, y = value, color = session_title, group = file_name), linewidth = .5) +
  scale_x_reverse() +
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1))) +
  theme(
    axis.text = element_text(size = 10),
    axis.text.x =element_text(size=12,angle=25),
    axis.title = element_text(size = 12),
    legend.position = "none",
    strip.text = element_text(size = 14),
    legend.text = element_text(size = 10),
    legend.title = element_text(size = 12),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black")
  ) +
  facet_wrap(~specimen)

# Save RDS for data transparency and easily enable others to work with dataframe without all these complicated import steps
saveRDS(dfmeta_LPW, "RDS_dataframes/LPW_dfmeta.RDS")
saveRDS(dfmeta_longLPW, "RDS_dataframes/LPW_dfmeta_long.RDS")

LPW Data Exploration

Preliminary spectra and PCA plots were explored to see whether spectra of different scans were similar enough to warrant combining for analyses. Previous research has shown FT-NIRS spectra can change quite a bit over time and the triplicate scans of LPW otoliths were collected over nearly a year between the first and third scans.

# set run_number as factor for color groupings in plots
dfmeta_longLPW$run_number <- as.factor(dfmeta_longLPW$run_number)
dfmeta_LPW$run_number <- as.factor(dfmeta_LPW$run_number)
# Plot FT-NIRS scans, all 3 runs, colored by run_number
ggplot(dfmeta_longLPW) + geom_path(aes(x = wavenumber, y = value, 
                                       color = run_number, group = file_name),alpha = 0.5) + 
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) + 
  scale_x_reverse() + # FT-NIRS spectra often displayed with reversed x-axis for otoliths - lower wavenumbers generally more informative and are more easily seen away from axes
  ggtitle("LPW FT-NIRS Spectra, All Scan Sessions")

# appear to be slightly different  values for each scan, look at 1 & 2 first
ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1,2),]) + geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) + 
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) + 
  scale_x_reverse() + 
  ggtitle("LPW FT-NIRS Spectra, Scans 1 & 2")

# Scans 1 & 3
ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(1,3),]) + geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) + 
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) + 
  scale_x_reverse() + 
  ggtitle("LPW FT-NIRS Spectra, Scans 1 & 3")

# Scans 2 & 3
ggplot(dfmeta_longLPW[dfmeta_longLPW$run_number %in% c(2,3),]) + geom_path(aes(x = wavenumber, y = value, color = run_number, group = file_name)) + 
  labs(y = "Absorbance units", x = expression(paste("Wavenumber ", cm^-1)), color = "Run Number") +
  scale_color_viridis(discrete = T) + 
  scale_x_reverse() + 
  ggtitle("LPW FT-NIRS Spectra, Scans 2 & 3")

# Scan 1 looks quite different than 2 & 3.  These should be triplicate scans with identical values for each of the 122 specimens.  To confirm, I extract/plot the first 2 principal components for all FT-NIRS wavenumbers (indices 21:ncol(dfmeta_LPW)) and color points by run number.
pca_all_LPW <- pca(dfmeta_LPW[21:nrow(dfmeta_LPW)], scale = T)
pcs <- as.data.frame(cbind(pc2 = pca_all_LPW$calres$scores[,2], run_number = dfmeta_LPW$run_number)) # extract scores for PC2
pcs <- cbind(pc1 = pca_all_LPW$calres$scores[,1], pcs) # extract scores for PC1

# plot PC1 & PC2, colored by run_number
ggplot(pcs) + 
  geom_point(aes(x = pc1, y = pc2, color = as.factor(run_number)), size = 3) + 
  labs(x = paste("PC1 (", 
                 round(pca_all_LPW$calres$expvar[1], digits = 3),# variance explained by PC1
                 "% var. explaiend )"),
       y = paste("PC2 (", 
                 round(pca_all_LPW$calres$expvar[2], digits = 3), # variance explained by PC2
                 "% var. explaiend )"),
       color = "Run Number") +
  scale_color_viridis(discrete = T)

# run/scan #1 confirmed to be quite different.  Something appears to have changed between scan periods, however it is unknown exactly what occurred.  All future analysis will combine absorbance measurements from scans 2 & 3 and average.  

Combining/averaging LPW FT-NRIS scans 2 & 3

scan_2 <- dfmeta_LPW %>% dplyr::filter(run_number == 2)
# scan_2_long <- pivot_longer(scan_2, cols = `11536`:`3952`, names_to = "name", values_to = "value")
# scan_2_long$name <- as.numeric(scan_2_long$name)
scan_3 <- dfmeta_LPW %>% dplyr::filter(run_number == 3)
# scan_3_long <- pivot_longer(scan_3, cols = `11536`:`3952`, names_to = "name", values_to = "value")
# scan_3_long$name <- as.numeric(scan_3_long$name)
scan_avg <- bind_cols(NULL, scan_2[, 1:20])
scan_avg <- bind_cols(scan_avg, (scan_2[, 21:ncol(scan_2)] + scan_3[, 21:ncol(scan_3)]) / 2)
rm(scan_2, scan_3)
scan_avg_long <- pivot_longer(scan_avg, cols = `11536`:`3952`, names_to = "name", values_to = "value")
scan_avg_long$name <- as.numeric(scan_avg_long$name)
saveRDS(scan_avg, "RDS_dataframes/LPW_scan_avg.RDS")
saveRDS(scan_avg_long, "RDS_dataframes/LPW_scan_avg_long.RDS")

Preprocessing Spectra

Spectral preprocessing serves to remove unwanted noise from absorbance measurements, filter out NIR backscatter, and can improve model fits by increasing variability of absorbance measurements at wavenumbers between specimens. Initial preprocessing done with prospectr and a savitzkygolay filter. Some functions are included to play around with preprocessing filters. Will stick with one method and explore others later if time allows.

# ggplot function to plot spectra, colored by some factor
spec.fig <- function(mydf, color) { 
  #color <- as.character({{color}})
  ggplot(mydf) +
  geom_path(aes(x = name, y = value, color = {{ color }}, group = file_name)) +
  scale_x_reverse() +
  scale_color_viridis() + 
  labs(y = "Preprocessed absorbance", x = expression(paste("Wavenumber ", cm^-1)))
  #theme(axis.text = element_text(size = 16),
    # axis.text.x = element_text(size = 12, angle = 25),
    # axis.title = element_text(size = 14),
    # legend.position = "right",
    # strip.text = element_text(size = 14),
    # legend.text = element_text(size = 14),
    # legend.title = element_text(size = 14),
    # panel.grid.major = element_blank(),
    # panel.grid.minor = element_blank(),
    # panel.background = element_blank(),
    # axis.line = element_line(colour = "black")
  #)
}
spec.fig(scan_avg_long,length)

# function to plot results of different savitzkyGolay filter params
sg_plotting <- function(color, m, p, w) { 
  dftempproc <- as.data.frame(
    cbind(scan_avg[, c(1:20)],
          savitzkyGolay(scan_avg[,21:length(scan_avg)],
                        m = m, p = p, w = w)
  ))
  dftempproc_long <- tidyr::pivot_longer(dftempproc, cols = c(21:length(dftempproc)))
  dftempproc_long$name <- as.numeric(as.character(dftempproc_long$name))
  spec.fig(mydf = dftempproc_long, color = {{ color }}) + 
    ggtitle(paste("diff = ", {{ m }}, "poly = ", {{ p}}, "window = ", {{ w }}))
}
sg_plotting(specimen,1,3,15)

sg_plotting(specimen,1,2,15)

# function to apply savitzkyGolay filter with selected parameters to dataframe and create new temp_proc and temp_proc_long dataframes for modelling
quickproc <- function(df, m, p, w){
  temp_proc <<- as.data.frame(
    cbind({{df}}[, c(1:20)],
          savitzkyGolay({{df}}[,21:length({{df}})],
                        m = m, p = p, w = w)
  ))
  temp_proc_long <<- tidyr::pivot_longer(temp_proc, cols = c(21:length(temp_proc)))
  temp_proc_long$name <<- as.numeric(as.character(temp_proc_long$name))
}

Variable/Wavelength Selection & PLS on length as a test

Using temp_proc df produced from quickproc() function above. VIP score to select wavelengths that are most informative to model

quickproc(scan_avg,1,3,17) # SG of scan_avg DF to genereate temp_proc df for models below

test_pls_length <- pls(temp_proc[, 21:ncol(temp_proc)], temp_proc[, 2],
  scale = T, center = F, info = "Length Prediction Model",
  cv = 1
)

# pls only with wavenumebrs VIP > 1
length_VIP <- vipscores(test_pls_length) # extract VIP values for wavenumbers of above pls model
plotVIPScores(test_pls_length)

test_waves <- names(subset(length_VIP[,1], length_VIP[,1] > 1)) 
all_waves <- names(temp_proc[,21:length(temp_proc)])
bad_waves <- all_waves[!(all_waves %in% test_waves)]
test_pls_length_VIP <- pls(temp_proc[,21:ncol(temp_proc)], temp_proc[,2],
  scale = T, center = F, info = "Length Prediction Model",
  exclcols = bad_waves,
  cv = 1
)

# unprocessed df PLS
test_pls_length_unproc <- pls(scan_avg[, 21:ncol(scan_avg)], scan_avg[, 2],
  scale = T, center = F, info = "Length Prediction Model",
  cv = 1
)


# 90% train, 10% test PLS, no VIP
set.seed(1)
idx <- floor(nrow(temp_proc) * 0.9)
train_idx <- sample(seq_len(nrow(temp_proc)), size = idx)
train <- temp_proc[train_idx, -c(1,3:20)]
test <- temp_proc[-train_idx, -c(1,3:20)]
test_pls_length_split <- pls(train[,-c(1)], train[,1],
  scale = T, center = F,
  info = "Length Prediction Model", cv = 1,
  x.test = test[,-c(1)], y.test = test[,1]
)

# 90% train, 10% test PLS, VIP
test_pls_length_split_VIP <- pls(train[,-c(1)], train[,1],
  scale = T, center = F,
  info = "Length Prediction Model", cv = 1,
  x.test = test[,-c(1)], y.test = test[,1],
  exclcols = bad_waves
)

plot(test_pls_length)

plot(test_pls_length_VIP)

plot(test_pls_length_unproc)

plot(test_pls_length_split)

plot(test_pls_length_split_VIP)

test_pls_length$res$cal$rmse[2]
## [1] 10.53403
test_pls_length_VIP$res$cal$rmse[3]
## [1] 9.619934
test_pls_length_unproc$res$cal$rmse[3]
## [1] 10.07586
test_pls_length_split$res$test$rmse[2]
## [1] 7.183961
test_pls_length_split_VIP$res$test$rmse[3]
## [1] 7.118468

MLR & PCA using length

pca_avg_LPW <- pca(scan_avg[21:nrow(scan_avg)], scale = T) # PCA for scan_avg
plot(pca_avg_LPW)

mlr_pcs <- pca_avg_LPW$calres$scores[, 1:20] # extract PCs
mlr_pcs <- cbind(mlr_pcs[, 1:5], scan_avg[, 1:4]) # only store first 5 PC's
mlr_length <- lm(data = mlr_pcs, length ~ `Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + `Comp 5`)
summary(mlr_length) # only appear to need first 2 comps
## 
## Call:
## lm(formula = length ~ `Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + 
##     `Comp 5`, data = mlr_pcs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.944  -5.600   0.456   6.679  23.577 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 123.27869    0.95139 129.577  < 2e-16 ***
## `Comp 1`      2.36945    0.09463  25.040  < 2e-16 ***
## `Comp 2`     43.68759    5.57116   7.842 2.39e-12 ***
## `Comp 3`    -18.43889   14.66028  -1.258    0.211    
## `Comp 4`     -6.00035   15.29366  -0.392    0.696    
## `Comp 5`     -7.21902   17.31533  -0.417    0.678    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.51 on 116 degrees of freedom
## Multiple R-squared:  0.8562, Adjusted R-squared:   0.85 
## F-statistic: 138.1 on 5 and 116 DF,  p-value: < 2.2e-16
mlr_length <- update(mlr_length, length ~ `Comp 1` + `Comp 2`)
summary(mlr_length)
## 
## Call:
## lm(formula = length ~ `Comp 1` + `Comp 2`, data = mlr_pcs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.161  -5.205   0.403   5.730  23.797 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 123.27869    0.94702 130.175  < 2e-16 ***
## `Comp 1`      2.36945    0.09419  25.155  < 2e-16 ***
## `Comp 2`     43.68759    5.54558   7.878 1.75e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.46 on 119 degrees of freedom
## Multiple R-squared:  0.8538, Adjusted R-squared:  0.8513 
## F-statistic: 347.4 on 2 and 119 DF,  p-value: < 2.2e-16
# MLR with split
set.seed(1)
idx <- floor(nrow(scan_avg) * 0.9) # 90% of indices needed for training set
train_idx <- sample(seq_len(nrow(scan_avg)), size = idx) # generate indices for training set
train <- mlr_pcs[train_idx, -c(6,8:9)] # wavenumber measurement columns only
test <- mlr_pcs[-train_idx, -c(6,8:9)]
mlr_length_split <- lm(data = train, length ~ `Comp 1` + `Comp 2`)
summary(mlr_length_split)
## 
## Call:
## lm(formula = length ~ `Comp 1` + `Comp 2`, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.421  -5.305   0.432   6.022  23.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  123.327      1.031 119.593  < 2e-16 ***
## `Comp 1`       2.337      0.102  22.912  < 2e-16 ***
## `Comp 2`      45.640      6.275   7.274 6.31e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.76 on 106 degrees of freedom
## Multiple R-squared:  0.8433, Adjusted R-squared:  0.8403 
## F-statistic: 285.2 on 2 and 106 DF,  p-value: < 2.2e-16
mlr_length_split_pred <- predict(mlr_length_split,test)
ggplot() +  # fitted values vs actual, out of sample prediction in red
  geom_point(aes(train$length,mlr_length_split$fitted.values)) + 
  geom_point(aes(test$length,mlr_length_split_pred),col = "red") + 
  geom_abline(slope=1, intercept = 0, col = "red")

# Split for PCR
set.seed(1)
idx <- floor(nrow(scan_avg) * 0.9) # 90% of indices needed for training set
train_idx <- sample(seq_len(nrow(scan_avg)), size = idx) # generate indices for training set
train <- scan_avg[train_idx, -c(1, 3:20)] # length and wavenumber measurement columns only
test <- scan_avg[-train_idx, -c(1, 3:20)]

# PCR model 
pcr_length_model <- pls::pcr(length ~ .,
  data = train,
  scale = T,
  validation = "CV",
  ncomp = 10
) # look at 10 components max
selectNcomp(pcr_length_model, "onesigma", plot = TRUE) # determine appropriate number of comps

## [1] 1
scoreplot(pcr_length_model) # plot of first 2 PCs

loadingplot(pcr_length_model)

pcr_length_pred <- predict(pcr_length_model, newdata = test, ncomp = 2) # fit model to test data, store fitted values
ggplot() +
  geom_point(aes(x = train$length, y = pcr_length_model$fitted.values[, , 2])) + # extract fitted values of model with 2 PCs.
  geom_point(aes(x = test$length, y = pcr_length_pred), col = "red") +
  geom_abline(slope = 1, col = "red")

# PLS for Age - Create age DF (remove specimens missing age data)

# df with only aged specimens
age_only <- scan_avg[complete.cases(scan_avg$read_age),]
# generate preprocessed df of ages, temp_proc & temp_proc_long
quickproc(age_only,2,3,19)
# test pls, no variable selection
test_pls_age <- pls(temp_proc[, 21:ncol(temp_proc)], temp_proc[, 11],
  scale = F, center = T, info = "Age Prediction Model",
  cv = 1
)

# pls with VIP > 1
test_vip_age <- vipscores(test_pls_age)
plotVIPScores(test_pls_age)

good_waves <- as.numeric(names(subset(test_vip_age[,1], test_vip_age[,1] > 1)))
all_waves <- names(temp_proc[,21:length(temp_proc)])
bad_waves <- all_waves[!(all_waves %in% good_waves)]

test_pls_age_VIP <- pls(temp_proc[, 21:ncol(temp_proc)], temp_proc[, 11],
 scale = F, center = T, info = "Age Prediction Model",
 cv = 1, exclcols = bad_waves
)

# 90/10 split for PLS, no VIP
set.seed(1)
idx <- floor(nrow(temp_proc) * 0.9)
train_idx <- sample(seq_len(nrow(temp_proc)), size = idx)
train <- temp_proc[train_idx, -c(1:10,12:20)]
test <- temp_proc[-train_idx, -c(1:10,12:20)]

test_pls_age_split <- pls(train[,-1], train[,1],
  scale = F, center = T,
  info = "Age Prediction Model", cv = 1,
  x.test = test[,-1], y.test = test[,1]
)

# 90/10 split for PLS, VIP > 1
test_pls_age_split_VIP <- pls(train[,-1], train[,1],
  scale = F, center = T,
  info = "Age Prediction Model", cv = 1,
  x.test = test[,-c(1)], y.test = test[,1],
  exclcols = bad_waves
)

plot(test_pls_age)

plot(test_pls_age_VIP)

plot(test_pls_age_split)

plot(test_pls_age_split_VIP)

test_pls_age$res$cal$rmse[3]
## [1] 10.45383
test_pls_age_VIP$res$cal$rmse[2]
## [1] 14.37784
test_pls_age_split$res$test$rmse[3]
## [1] 17.03094
test_pls_age_split_VIP$res$cal$rmse[2]
## [1] 14.31215

PCR for age

quickproc(age_only,2,3,19)
pca_age_LPW <- pca(temp_proc[21:nrow(temp_proc)], scale = T)
plot(pca_age_LPW)

mlr_age_pcs <- pca_age_LPW$calres$scores[,1:20]
mlr_age_pcs <- cbind(mlr_age_pcs[,1:10], temp_proc[,1:11]) # extract first 10 PC's
mlr_age <- lm(data = mlr_age_pcs, read_age ~ `Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + `Comp 5`)
summary(mlr_age)
## 
## Call:
## lm(formula = read_age ~ `Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + 
##     `Comp 5`, data = mlr_age_pcs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.820 -21.264  -7.735  17.378  90.137 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 156.2807     4.0967  38.148   <2e-16 ***
## `Comp 1`      1.4109     1.2503   1.128   0.2644    
## `Comp 2`     -2.6409     1.4764  -1.789   0.0796 .  
## `Comp 3`     -3.2187     1.8059  -1.782   0.0806 .  
## `Comp 4`     -0.8587     1.9857  -0.432   0.6673    
## `Comp 5`     -4.4479     2.6712  -1.665   0.1020    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.93 on 51 degrees of freedom
## Multiple R-squared:  0.1722, Adjusted R-squared:  0.09105 
## F-statistic: 2.122 on 5 and 51 DF,  p-value: 0.0777
# MLR with split
set.seed(1)
idx <- floor(nrow(age_only) * 0.9) # 90% of indices needed for training set
train_idx <- sample(seq_len(nrow(age_only)), size = idx) # generate indices for training set
train <- mlr_age_pcs[train_idx, -c(11:20)] # wavenumber measurement columns only
test <- mlr_age_pcs[-train_idx, -c(11:20)]
mlr_age_split <- lm(data = train, read_age ~`Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + `Comp 5`)
summary(mlr_age_split)
## 
## Call:
## lm(formula = read_age ~ `Comp 1` + `Comp 2` + `Comp 3` + `Comp 4` + 
##     `Comp 5`, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.73 -21.14  -8.58  16.45  91.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 154.9436     4.3548  35.580   <2e-16 ***
## `Comp 1`      0.7844     1.3129   0.597   0.5532    
## `Comp 2`     -3.2988     1.5725  -2.098   0.0416 *  
## `Comp 3`     -2.8779     1.8872  -1.525   0.1343    
## `Comp 4`     -0.3173     2.1300  -0.149   0.8822    
## `Comp 5`     -5.3221     2.7197  -1.957   0.0566 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.84 on 45 degrees of freedom
## Multiple R-squared:  0.1928, Adjusted R-squared:  0.1031 
## F-statistic:  2.15 on 5 and 45 DF,  p-value: 0.07666
mlr_age_split_pred <- predict(mlr_age_split,test)
ggplot() +  # fitted values vs actual, out of sample prediction in red
  geom_point(aes(train$read_age,mlr_age_split$fitted.values)) + 
  geom_point(aes(test$read_age,mlr_age_split_pred),col = "red") + 
  geom_abline(slope=1, intercept = 0, col = "red")

# Split for PCR
set.seed(1)
idx <- floor(nrow(temp_proc) * 0.9)
train_idx <- sample(seq_len(nrow(temp_proc)), size = idx)
train <- age_only[train_idx, -c(1:10,12:20)] 
test <- age_only[-train_idx, -c(1:10,12:20)] 

pcr_age_model <- pcr(read_age ~ .,
                 data = train, 
                 scale = T,
                 validation = "CV",
                 ncomp = 10)
selectNcomp(pcr_age_model, "onesigma", plot = TRUE) # determine appropriate number of comps

## [1] 1
scoreplot(pcr_age_model) # plot of first 2 PCs

pcr_age_pred <- predict(pcr_age_model, newdata = test, ncomp = 4) # fit model to test data, store fitted values
ggplot() +
  geom_point(aes(x = train$read_age, y = pcr_age_model$fitted.values[, , 4])) + # extract fitted values of model with 2 PCs.
  geom_point(aes(x = test$read_age, y = pcr_age_pred), col = "red") +
  geom_abline(slope = 1, col = "red")

RMSEP(pcr_age_model)
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           32.89    23.64    24.00    21.31    20.73    21.40    21.79
## adjCV        32.89    23.54    23.88    21.21    20.63    21.26    21.63
##        7 comps  8 comps  9 comps  10 comps
## CV       21.59    21.57    21.58     21.81
## adjCV    21.38    21.45    21.49     22.05